The pandas DataFrame object extends the capabilities of the Series object into 2-dimensions. A Series object adds an index to a NumPy array but can only associate a single data item per index label, a DataFrame integrates multiple Series objects by aligning them along common index labels.
A DataFrame object can be thought of as a dictionary-like container of one or more Series objects or as a spreadsheet or even to a database table.
A DataFrame also introduces the concept of multiple axes, specifically the horizontal and vertical axis. Functions from Pandas can then be applied to either axis and the operation can be applied horizontally to alll the values in the rows or up and down each column.
In [1]:
# import NumPy and pandas
import numpy as np
import pandas as pd
# set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows',10)
In [3]:
# create a DataFrame from a 2-d array
pd.DataFrame(np.array([[10,11],[20,21]]))
Out[3]:
In [4]:
# create a DataFrame from a list of Series objects
df1 = pd.DataFrame([pd.Series(np.arange(10,15)),pd.Series(np.arange(15,20))])
df1
Out[4]:
In [5]:
# what is the shape of the data frame
df1.shape
Out[5]:
In [7]:
# specify column names
df = pd.DataFrame(np.array([[10,11],[20,21]]), columns=['a','b'])
df
Out[7]:
In [9]:
# what are the name of he columns
df.columns
Out[9]:
In [10]:
# retrieve just the names of the columns by position
"{0},{1}".format(df.columns[0],df.columns[1])
Out[10]:
In [11]:
# rename the columns
df.columns = ['c','d']
df
Out[11]:
In [13]:
# create a dataframe with named rows and columns
df = pd.DataFrame(np.array([[10,11],[20,21]]),columns=['col1','col2'],index=['row1','row2'])
df
Out[13]:
In [14]:
df.index
Out[14]:
In [15]:
# create a dataframe with 2 series objects
# and a dictionary
s1 = pd.Series(np.arange(1,6,1))
s2 = pd.Series(np.arange(6,11,1))
pd.DataFrame({'c1':s1,'c2':s2})
Out[15]:
A DataFrame also performs automatic alignment of the data for each Series passed in by a dictionary. For example, the following code adds a third column in the DataFrame initialisation.
In [16]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12,14),index=[1,2])
df = pd.DataFrame({'c1':s1,'c2':s2,'c3':s3})
df
Out[16]:
In the above example, first two Series did not have an index specified so they were indexed with default labels 0..4. The third Series has index values, and therefore the values for those indexes are placed in DataFrame in the row with the matching index from the previous columns.
In [17]:
# show the first 3 lines of the file
!head -n 3 ../../data/sp500.csv
In [28]:
sp500 = pd.read_csv("../../data/sp500.csv",index_col='Symbol',usecols=[0,2,3,7])
# first five rows
sp500.head()
Out[28]:
In [20]:
# last 5 rows
sp500.tail()
Out[20]:
In [21]:
# how many rows of data
len(sp500)
Out[21]:
In [42]:
# examine the index
sp500.index
Out[42]:
In [23]:
# get the columns
sp500.columns
Out[23]:
In [24]:
# second dataset
!head -n 3 ../../data/omh.csv
In [25]:
# read in the data
one_mon_hist = pd.read_csv("../../data/omh.csv")
one_mon_hist[:3]
Out[25]:
In [36]:
# get first and second columns by position or location
sp500.columns
Out[36]:
In [47]:
type(sp500)
Out[47]:
In [62]:
sp500[sp500.columns[1]].head()
Out[62]:
In [50]:
df = sp500['Price']
df
Out[50]:
In [64]:
# create a new dataframe with integers as the columns names
# make sure to use .copy() or change will be in-place
df = sp500.copy()
df.columns = [0,1,2]
df.head()
Out[64]:
In [66]:
# because the column names are actually integers
# and therefore [1] is found as a column
df[1]
Out[66]:
In [67]:
df.columns
Out[67]:
In [68]:
# get price column by name
# result is a series
sp500['Price']
Out[68]:
In [70]:
# get price and sector columns
# since a list is passed, the result is a DataFrame
sp500[['Price','Sector']]
Out[70]:
In [71]:
# attribute access of the column by name
sp500.Price
Out[71]:
In [72]:
loc = sp500.columns.get_loc('Price')
loc
Out[72]:
In [73]:
# first five rows
sp500[:5]
Out[73]:
In [74]:
sp500['ABT':'ACN']
Out[74]:
In [75]:
sp500.loc['ACN']
Out[75]:
In [78]:
sp500.loc[['MMM','MSFT']]
Out[78]:
In [79]:
# get rows in locations 0 and 2
sp500.iloc[[0,2]]
Out[79]:
In [80]:
# get the location of MMM and A in the index
i1 = sp500.index.get_loc('MMM')
i2 = sp500.index.get_loc('A')
"{0},{1}".format(i1,i2)
Out[80]:
In [81]:
# and get the rows
sp500.iloc[[i1,i2]]
Out[81]:
In [82]:
# by label
sp500.ix[['MSFT','ZTS']]
Out[82]:
In [84]:
# by label in both the index and column
sp500.at['MMM','Price']
Out[84]:
In [85]:
# by location. Row 0, column 1
sp500.iat[0,1]
Out[85]:
In [87]:
# what rows have a price < 100 ?
sp500.Price < 100
Out[87]:
In [88]:
# get only the Price where price is < 10 and > 0
r = sp500[(sp500.Price < 10) & (sp500.Price > 0)] [['Price']]
r
Out[88]:
In [89]:
# rename the Book Value colun to not have a space
# this returns a copy with the column renamed
df = sp500.rename(columns={'Book Value':'BookValue'})
# print first 2 rows
df[:2]
Out[89]:
In [90]:
df.columns
Out[90]:
In [91]:
# old dataframe remains intact
sp500.columns
Out[91]:
In [93]:
# this changes the column in-place
sp500.rename(columns={'Book Value':'BookValue'},inplace=True)
sp500.columns
Out[93]:
In [94]:
sp500.BookValue[:5]
Out[94]:
In [95]:
# make a copy
copy = sp500.copy()
copy['TwicePrice'] = sp500.Price * 2
copy[:2]
Out[95]:
In [96]:
copy = sp500.copy()
copy.insert(2,'TwicePrice',sp500.Price*2)
copy[:2]
Out[96]:
It is important to remember that this is not simply inserting a column into the DataFrame. The alignment process used here is performing a left join of the DataFrame and the Series by their index labels and then creating the column and populating data in the appropriate cell in the DataFrame.
In [97]:
# extract the first 4 rows and Price column
rcopy = sp500[0:3][['Price']].copy()
rcopy
Out[97]:
In [100]:
# create a new series to merge as a column
# one label exists in rcopy(MSFT) and MM does not
s = pd.Series({'MMM':'Is in the DataFrame','MSFT':'Is not in the DataFrame'})
s
Out[100]:
In [101]:
rcopy['Comment'] = s
rcopy
Out[101]:
In [102]:
# replace the Price column data with the new values
# instead of adding a new column
copy = sp500.copy()
copy.Price = sp500.Price * 2
copy[:5]
Out[102]:
In [103]:
# copy all 500 rows
copy = sp500.copy()
prices = sp500.iloc[[3,1,0]].Price.copy()
# examine the extracted prices
prices
Out[103]:
In [104]:
# now replace the Price column with prices
copy.Price = prices
# it is not really simple insertion, it is alignment
# values are put in the correct place according to labels
copy
Out[104]:
Columns can be deleted from a DataFrame by using the del keyword, the pop(column) method of the DataFrame, or by calling the drop() method of the DataFrame.
In [105]:
# Example of using del to delete a column
# make a copy of a subset of the data frame
copy = sp500[:2].copy()
copy
Out[105]:
In [106]:
del copy['BookValue']
copy
Out[106]:
In [108]:
# Example of using pop to remove a column from a DataFrame
# first make a copy of a subset of the data frame
# pop works in-place
copy = sp500[:2].copy()
popped = copy.pop('Sector')
copy
Out[108]:
In [110]:
# and we have the Sector column as the result of the pop
popped
Out[110]:
In [111]:
# Example of using drop to remove a column
# make a copy of a subset of the DataFrame
copy = sp500[:2].copy()
afterdrop = copy.drop(['Sector'],axis=1)
afterdrop
Out[111]:
In [115]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy the 10th and 11th rows
df2 = sp500.iloc[[10,11,2]]
# append df1 and df2
appended = df1.append(df2)
# the result is the rows of the first followed by those in the second
appended
Out[115]:
The set of columns of the DataFrame objects being appended do not need to be the same. The resulting DataFrame will consist of the union of the columns in both and where either did not have a columns, NaN will be used as the value.
In [117]:
# DataFrame using df1.index and just a PER columns
# also a good example of using a scalar value
# to initialize multiple rows.
df3 = pd.DataFrame(0.0,index=df1.index,columns=['PER'])
df3
Out[117]:
In [118]:
# append df1 and df3
# each has three rows, so 6 rows is the result
# df1 had no PER column, so NaN for those rows
# df3 had no BookValue, Price or Sector, so NaN values
df1.append(df3)
Out[118]:
To append without forcing the index to be taken from either DataFrame, you can use the ignore_index=True parameter.
In [120]:
# ignore index labels, create default index
df1.append(df3,ignore_index=True)
Out[120]:
In [121]:
# copy the first 3 rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy 10th and 11th rows
df2 = sp500.iloc[[10,11,2]]
## pass them as a list
pd.concat([df1,df2])
Out[121]:
Actually, pandas calculates the sorted union of distinct column names across all supplied objects and uses those as the columns and then appends data along the rows for each object in the order given in the list.
In [122]:
# copy df2
df2_2 = df2.copy()
# add column to df2_2 that is not in df1
df2_2.insert(3,'Foo',pd.Series(0,index=df2.index))
df2_2
Out[122]:
In [123]:
# now concatenate
pd.concat([df1,df2_2])
Out[123]:
Using the keys parameter, it is possible to differentiate the pandas objects from which the rows originated. The following code adds a level to the index which represents the source object.
In [124]:
# specify keys
r = pd.concat([df1,df2_2],keys=['df1','df2'])
r
Out[124]:
We can change the axis of the concatenation to work along the columns by specifying axis = 1, which will calculate the sorted union of the distinct index labels from the rows and then append columns and their data from the specified objects.
In [131]:
# first three rows, columns 0 and 1
# causing error => df3 = sp500[:3][[0,1]]
df3 = sp500[:3][['Price','Sector']]
df3
Out[131]:
In [132]:
df4 = sp500[:3][['BookValue']]
df4
Out[132]:
In [133]:
# put them back together
pd.concat([df3,df4],axis=1)
Out[133]:
We can further examing this operation by adding a column to the second DataFrame that has a duplicate name to a column in the first. The result will have duplicate columns, as they are blindly appended.
In [134]:
# make a copy of df4
df4_2 = df4.copy()
# add a column to df4_2, that is also in df3
df4_2.insert(1,'Sector',pd.Series(1,index=df4_2.index))
df4_2
Out[134]:
In [135]:
# demonstrate duplicate columns
pd.concat([df3,df4_2],axis=1)
Out[135]:
pandas is performing an outer join along the labels of the specified axis. An inner join can be specified using the join='inner' parameter.
In [137]:
# first three rows and first two columns
df5 = sp500[:3][['Sector','Price']]
df5
Out[137]:
In [138]:
# row 2 through 4 and first tow columns
df6 = sp500[2:5][['Sector','Price']]
df6
Out[138]:
In [139]:
# inner join on index labels will return in only one row
pd.concat([df5,df6],join='inner',axis=1)
Out[139]:
In [140]:
# get a email subset of the sp500
# make sure to copy the slice to make a copy
ss = sp500[:3].copy()
# create a new row with index lable FOO
# and assign some values to the columns via a list
ss.loc['FOO'] = ['the sector',100,110]
ss
Out[140]:
Note that the change is made in-place. If FOO already exists as an index label, then the column data would be replaced. This is one of the means of updating data in a DataFrame in-place as .loc not only retrieves row(s), but also lets you modify the results that are returned.
It is also possible to add columns in this manner.
In [141]:
# copy of the subset/ slice
ss = sp500[:3].copy()
# add the new column initialized to 0
ss.loc[:,'PER'] = 0
ss
Out[141]:
In [142]:
# get a copy of the first 5 rows of sp500
ss = sp500[:5].copy()
ss
Out[142]:
In [143]:
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['ABT','ACN'])
afterdrop
Out[143]:
In [144]:
# note that ss is not modified
ss
Out[144]:
In [145]:
# determine the rows where Price > 300
selection = sp500.Price > 300
# to make the output shorter, report the # of row returned (500),
# and the sum of those where Price > 300 (which is 10)
"{0} {1}".format(len(selection),selection.sum())
Out[145]:
In [146]:
# select the complement
withPriceLessThan300 = sp500[~selection]
withPriceLessThan300
Out[146]:
In [147]:
# get only the first 3 rows
onlyfirstthree = sp500[:3]
onlyfirstthree
Out[147]:
In [148]:
# first three but a copy of them
# get only the first 3 rows
onlyfirstthree = sp500[:3].copy()
onlyfirstthree
Out[148]:
In [149]:
# get a subset / copy of the data
subset = sp500[:3].copy()
subset
Out[149]:
In [150]:
# change scalar by label on row and column
subset.ix['MMM','Price'] = 0
subset
Out[150]:
.loc may suffer from lower performance as compared to .iloc due to the possibility of needing to map the label values into locations.
In [152]:
# subset of first three rows
subset = sp500[:3].copy()
# get the location of the Price column
price_loc = sp500.columns.get_loc('Price')
# get the location of the MMM row
abt_row_loc = sp500.index.get_loc('ABT')
# change the price
subset.iloc[abt_row_loc,price_loc] = 1000
subset
Out[152]:
In [153]:
# set the seed to allow replicatable results
np.random.seed(123456)
# create the data frame
df = pd.DataFrame(np.random.randn(5,4), columns=['A','B','C','D'])
df
Out[153]:
In [154]:
# multiply everything by 2
df * 2
Out[154]:
In [155]:
# get first row
s = df.iloc[0]
# subtract first row from every row of the dataframe
diff = df - s
diff
Out[155]:
When performing an operation between a DataFrame and a Series, pandax will align the Series index along the DataFrame columns performing what is referred to as a row-wise broadcast.
In [157]:
# subtract dataframe from series
diff = s - df
diff
Out[157]:
The set of columns returned will be the union of the labels in the index of both the series and columns index of the DataFrame object. If a label representing the result column is not found in either the Series of the DataFrame object, then the values will be NaN filled.
The following code demonstrates, by creating a series with an index representing a subset of the column in the DataFrame, but also with an additional label.
In [159]:
# B, C
s2 = s[1:3]
# add E
s2['E'] = 0
# see how alignment is applied in math
df + s2
Out[159]:
In [160]:
# get rows 1 through three and only B,C columns
subframe = df[1:4][['B','C']]
# we have extracted a little square in the middle of the df
subframe
Out[160]:
In [161]:
# demonstrate the alignment of the subtraction
df - subframe
Out[161]:
In [162]:
# get the A column
a_col = df['A']
df.sub(a_col, axis=0)
Out[162]:
In [163]:
# reset the index, moving it into a column
reset_sp500 = sp500.reset_index()
reset_sp500
Out[163]:
One or more columns can also be moved into the inext. We can use the set_index() method for this.
In [164]:
reset_sp500.set_index('Symbol')
Out[164]:
In [165]:
# get first four rows
subset = sp500[:4].copy()
subset
Out[165]:
In [166]:
# reindex to have MMM,ABBV and FOO index labels
reindexed = subset.reindex(index=['MMM','ABBV','FOO'])
# not that ABT and ACN are dropped and FOO has NaN values
reindexed
Out[166]:
Reindexing can also be done upon the columns.
In [169]:
# reindex columns
subset.reindex(columns=['Price','BookValue','New_Column'])
Out[169]:
In [170]:
# first push symbol into a column
reindexed = sp500.reset_index()
# and now index sp500 by sector and symbol
multi_fi = reindexed.set_index(['Sector','Symbol'])
multi_fi
Out[170]:
In [171]:
# the index is a multiindex
# examine the index
type(multi_fi.index)
Out[171]:
In [175]:
# examine the index
multi_fi.index
Out[175]:
In [176]:
# this has 2 levels
len(multi_fi.index.levels)
Out[176]:
In [177]:
# each index level is an index
multi_fi.index.levels[0]
Out[177]:
Values of the index, at a specific level for every row, can be retrieved by the .get_level_values() method:
In [178]:
# values of the index level 0
multi_fi.index.get_level_values(0)
Out[178]:
In [179]:
# get all the stocks that are industrials
# note the result drops level 0 of the index
multi_fi.xs('Industrials')
Out[179]:
In [181]:
# select rows where level 1 is ALLE
# note that the sector level is dropped from the result
multi_fi.xs('ALLE', level=1)
Out[181]:
In [182]:
# Industrials, without dropping the level
multi_fi.xs('Industrials', drop_level=False)
Out[182]:
In [183]:
# drill through the levels
multi_fi.xs('Industrials').xs('UPS')
Out[183]:
In [185]:
# drill through using tuples
multi_fi.xs(('Industrials','UPS'))
Out[185]:
pandas provides several classes of statistical operations that can be applied to a Series or DataFrame object. These reductive methods, when applied to a Series, result in a single value.
When applied to a DataFrame, an axis can be specified and the method will then be either applied to each column or row and results in a Series.
In [186]:
# calc the mean of the values in each column
one_mon_hist.mean()
Out[186]:
In [187]:
# calc the mean of the values in each row
one_mon_hist.mean(axis=1)
Out[187]:
In [188]:
# calc the variance of the values in each column
one_mon_hist.var()
Out[188]:
In [189]:
# calc the median of the values in each column
one_mon_hist.median()
Out[189]:
In [190]:
# location of min price for both stocks
one_mon_hist[['MSFT','AAPL']].min()
Out[190]:
In [191]:
# and location of the max
one_mon_hist[['MSFT','AAPL']].max()
Out[191]:
Some pandas statistical methods are referred to as indirect statistics, for example, .idxmin() and .idxmax() return the index location where the minimum and maximum values exist respectively.
In [192]:
# location of the min price for both stocks
one_mon_hist[['MSFT','AAPL']].idxmin()
Out[192]:
In [193]:
one_mon_hist[['MSFT','AAPL']].idxmax()
Out[193]:
In [195]:
# find the mode of the Series
s = pd.Series([1,2,3,3,5])
s.mode()
Out[195]:
In [196]:
# there can be more than one mode
s = pd.Series([1,2,3,3,5,1])
s.mode()
Out[196]:
Accumulations in pandas are statistical methods that determine a value by continuously applying the next value in a Series to the current result.
In [197]:
# calculate a cumulative product
pd.Series([1,2,3,4]).cumprod()
Out[197]:
In [198]:
# calculate a cumulative sum
pd.Series([1,2,3,4]).cumsum()
Out[198]:
In [199]:
# summary statistics
one_mon_hist.describe()
Out[199]:
Non-numerical will result in a slightly different set of summary statistics.
In [202]:
# get summary statistics on non-numeric data
s = pd.Series(['a','a','b','c',np.NaN])
s.describe()
Out[202]:
In [201]:
# get summary stats of non-numeric data
s.count()
Out[201]:
In [203]:
# return a list of unique items
s.unique()
Out[203]:
In [204]:
# number of occurences of each unique value
s.value_counts()
Out[204]: